An Efficient Hash-based Association Rule Mining Approach for Document Clustering
نویسندگان
چکیده
Document clustering is one of the important research issues in the field of text mining, where the documents are grouped without predefined categories or labels. High dimensionality is a major challenge in document clustering. Some of the recent algorithms address this problem by using frequent term sets for clustering. This paper proposes a new methodology for document clustering based on Association Rules Mining. Our approach consists of three phases: the text preprocessing phase, the association rule mining phase, and the document clustering phase. An efficient Hash-based Association Rule Mining in Text (HARMT) algorithm is used to overcome the drawbacks of Apriori algorithm. The generated association rules are used for obtaining the partition, and grouping the partition that have the same documents. Furthermore, the resultant clusters are effectively obtained by grouping the partition by means of derived keywords. Our approach can reduce the dimension of the text efficiently for very large text documents, thus it can improve the accuracy and speed of the clustering algorithm. Key-Words: Document Clustering, knowledge discovery, Hashing, Association Rule Mining, Text Documents, Text Mining.
منابع مشابه
A new approach based on data envelopment analysis with double frontiers for ranking the discovered rules from data mining
Data envelopment analysis (DEA) is a relatively new data oriented approach to evaluate performance of a set of peer entities called decision-making units (DMUs) that convert multiple inputs into multiple outputs. Within a relative limited period, DEA has been converted into a strong quantitative and analytical tool to measure and evaluate performance. In an article written by Toloo et al. (2009...
متن کاملInvestigate the Performance of Document Clustering Approach Based on Association Rules Mining
The challenges of the standard clustering methods and the weaknesses of Apriori algorithm in frequent termset clustering formulate the goal of our research. Based on Association Rules mining, an efficient approach for Web Document Clustering (ARWDC) has been devised. An efficient Multi-Tire Hashing Frequent Termsets algorithm (MTHFT) has been used to improve the efficiency of mining association...
متن کاملClustering Web Documents based on Efficient Multi-Tire Hashing Algorithm for Mining Frequent Termsets
Document Clustering is one of the main themes in text mining. It refers to the process of grouping documents with similar contents or topics into clusters to improve both availability and reliability of text mining applications. Some of the recent algorithms address the problem of high dimensionality of the text by using frequent termsets for clustering. Although the drawbacks of the Apriori al...
متن کاملApplying a decision support system for accident analysis by using data mining approach: A case study on one of the Iranian manufactures
Uncertain and stochastic states have been always taken into consideration in the fields of risk management and accident, like other fields of industrial engineering, and have made decision making difficult and complicated for managers in corrective action selection and control measure approach. In this research, huge data sets of the accidents of a manufacturing and industrial unit have been st...
متن کاملAn Efficient Association Rule Mining Using the H-BIT Array Hashing Algorithm
Association Rule Mining (ARM) finds the interesting relationship between presences of various items in a given database. Apriori is the traditional algorithm for learning association rules. However, it is affected by number of database scan and higher generation of candidate itemsets. Each level of candidate itemsets requires separate memory locations. Hash Based Frequent Itemsets Quadratic Pro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012